Pandas and LetsPlot

Published

May 1, 2020

Skill Builder

Pandas and LetsPlot

For this skill builder, we are exploring some important functions in the package of pandas and LetsPlot. DS programming requires a lot of data wrangling. Using the proper functions, we can create concise and comprehensive codes. You should be exposed to a few functions through the readings this week.

You may want to at least scan the readings before beginning this task since this serves as an assessment of your understanding of the assigned readings. This should be able to be finished within 60 minutes. You should work through it on your own or in a group based in your professors instruction.


Data Import

Run the following code to import the data we need for this skill builder:


# package import
import numpy as np
import pandas as pd
from lets_plot import *
LetsPlot.setup_html(isolated_frame=True)

# data import
url = 'https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/AER/Guns.csv'
df = pd.read_csv(url)

Make sure the variable df is correctly assigned in your environment and finish the following exercises. You can read the documentation of the data on this page - https://vincentarelbundock.github.io/Rdatasets/doc/AER/Guns.html


Exercise 1

One of the first things we can do to a freshly imported data is to check its columns. This will help us understand the basic structure of the dataframe(table).

Using one line of code, select all the columns in dat, assign it to a variable called col_list.

Every dataframe has an attribute “columns”.
Accessing this attribute will give you a list of all column names

We often want to know the dimension of a dataframe. How many columns are in the dataset? How many rows are in the dataset?

Using one line of code, show the number of columns and rows in df.

Every dataframe has an attribute “shape”.
Accessing this attribute will give you the dimension of a datafarme

Now run df.head(). It will print out the first 5 rows of data in df.

Just from looking at the output, what column(s) seems to be redundant with the row number?

There is one column that serves as nothing but a row counter, that columns is redundant.


Exercise 2

After a brief investigation of the data, we will clean up the data. By cleaning up, we are trying to filter down df so this only holds data we need. We will first get rid of the extra column we found in the previous excercise.

Using one line of code, drop the redundant column using the variable col_list (created in excercise 1)

Use drop().

Understand what “axis” is as a parameter of drop().

Your function should looks like this:

df.drop([col_list[_]], axis = _)

fill the “_“’s with the correct values and assign the output to df.

Don’t forget to save the changes in df. Run df.head() to make sure the column is dropped in df.


Exercise 3

We have filtered df vertically by dropping a column. Now we will try to filter df horizontally, meaning we will get rid of some the rows.

We can do that by applying a condition to df. A condition is an expression that can be evaluated as True/False. For example, 8 > 5 is an expression that evaluates to be True. This is trivial because 8 will always be greater than 5.

Run the code below:

what is the difference between exp1 and exp2?

exp1 = 8 > 5
exp2 = df.violent < 300

Try type() on else variable OR calling else variable.

Run ths code below:

By putting df.violent < 300, and the violent column from df into a dataframe, what is the relationship between the two columns?

exp = pd.DataFrame({"df.violent < 300" : exp2,
                    "violent value from dat" : df.violent})

exp

Try computing df.violent[n] < 300 and (df.violent < 300)[n] where n is less than the number of row. The two expressions will always be the same as long as n is less than the number of rows.

Using query()to filter down the df so that it only contains the data for idaho

query() takes in expressions and filters down data.

Don’t forget to save the changes in df. Run df.shape() to make sure the there are 23 rows and 13 columns.


Exercise 4

Besides filtering, we can manipulate the data by adding new data to it. By adding a new column to the data, we assign a new value to each row.

Using assign(), create a new column that show the ratio between murder rate and violent rate.

Use assign()

You see get the ratio by computing this code:

df.murder/df.violent


Exercise 5

Create a scatter plot that shows the relationship between murder rate and violent rate for the state of Idaho. Your chart should show murder rate as the x-axis, violent as the y-axis.

Can you mimic this plot while using LetsPlot? https://lets-plot.org/


Because You’re Extra

Exercise 6

Using a line of code, filter down the data set so that it only shows the data in years between 1993 and 1997.


Exercise 7

Create a line chart that show prisoners numbers for the state of Idaho, Utah, and Oregon.

Your chart should show year as the x-axis, prisoner as the y-axis, states as different colours, along with an appropriate title.


Exercise 8

Without using query(), finshed the data wrangling in question 2,5 and 6.


See the script.

Back to top